Replication for Efficiency and Fault Tolerance in a Dsm System

نویسنده

  • Anne-Marie KERMARREC
چکیده

Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechanism. Our RDSM's design has focused on exploiting replication of data for both fault-tolerance and eeciency. This RDSM has been implemented on a NOW and performance evaluation shows the beneets of exploiting both types of replication to design an eecient, scalable and low-cost recoverable DSM.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Icare: Combining Efficiency and High-availability in a Dsm System

In light of the increasing throughput of local area networks, Networks Of Workstations (NOW) which provide a distributed shared memory (DSM) have become a convenient alternative to parallel architectures in the framework of parallel scientific applications. ICARE is a recoverable DSM based on backward error recovery which is implemented on top of an experiments ATM platform running the CHORUS m...

متن کامل

Improving the Efficiency of Replication for Highly Reliable Systems

Fault Tolerance must be provided to increase system reliability. Combining efficiency with fault tolerance is a difficult task. Fault Tolerance requires the use of redundancy while efficiency requires the elimination of redundancy. Several fault tolerance techniques have been proposed in the literature to manage the redundancy existing in the system in order to provide fault tolerance. These te...

متن کامل

Fault tolerance and configurability in DSM coherence protocols

With the advent of large networks and the demand to have uninterrupted service, computer systems need to be more robust and fault tolerant. There are numerous ways to implement fault tolerance and recovery. A central concept in all these methods is the requirement for replicated data for high data availability. We believe that a protocol must not only provide replication, but do so at low opera...

متن کامل

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...

متن کامل

Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network

The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) is a low-latency, high-bandwidth interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over one hundred nodes. Each node has a dedicated output channel and an array of receivers, with one receiver dedicated to every other node’s output channel. The SOME-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007